17-3-7

Introduction

Introduction

  • We show some examples of plot styles we should avoid, explain how to improve them, and use these as motivation for a list of principles.
  • We compare and contrast plots that follow these principles to those that don't.

Introduction

  • The principles are mostly based on research related to how humans detect patterns and make visual comparisons.
  • The preferred approaches are those that best fit the way our brains process visual information.
  • When deciding on a visualization approach it is also important to keep our goal in mind.
  • We may be comparing a viewable number of quantities, describing distribution for categories or numeric values, comparing the data from two groups, or describing the relationship between two variables.

Encoding data using visual cues

  • We start by describing some principles for encoding data.
  • There are several approaches at our disposal including:
    • position
    • aligned lengths
    • angles
    • area
    • brightness
    • color hue.

First example

Encoding data with angles and areas: not recommended

Encoding data with just area: even less recommended

pie chart vs barplots

If forced to make a pie chart at percentages

Visual cues

  • Position and length are the preferred ways to display quantities over angles which are preferred to area.

  • Brightness and color are even harder to quantifying that angles and area but, as we will see later, they are sometimes useful when more than two dimensions are being displayed.

When to include 0

  • When using length (e.g. barplots) it is misleading not to start the bars at 0.

  • This is because, by using a barplot, we are implying the length is proportional to the quantities being displayed.

  • By avoiding 0, relatively small difference can be made to look much bigger than they actually are.

  • This approach is often used by politicians or media organizations trying to exaggerate a difference.

Example

  • (Source: Fox News, via Media Matters via Fox News via Peter Aldhous

Same data with plot that includes 0

Another example

Same data with plot the includes 0

When not to include 0

  • When using position rather than length, it is not necessary to include 0.

  • This is particularly the case when we want to compare differences between groups relative the variability seen within the groups.

Example: Life expectancy by continent in 2012

Do not distrort quantities

Use area not radius

But we should not be using area or radius

Order by a meaningful value

Order by a meaningful value

Another example: Average income by region in 1970 (colors = continent)

Show the data

Show the data

jitter and alpha blending

Compare distributions if too many points

Ease comparisons: Use common axes

Ease comparisons: align vertically

Ease comparisons: align horizontally

Comparison

Consider transformations

  • As an example consider this barplot showing the average population sizes for each continent in 2015:

Consider transformations

Show the data

Consider transformations

How can we ease comparisons?

Comparisons should be adjacent

Use color to highlight comparison

Think of the color blind

  • About 10% of the population is color blind.
  • Unfortunately, the default colors used in ggplot are not optimal for this group.
  • However, ggplot does it make it easy to change the color palette used in the plots.

Think of the color blind

  • Here is an example of how we can use color blind friendly pallet described here:

Think of the color blind

Think of the color blind

  • There are several resources that help you select colors, for example this one.

Alternatives to scatterplots

  • Scatter plots are default when comparing two variables
  • Slope charts are good for before after comparisons
  • Bland-Altman plots are good for when we care about difference

Slope charts

Scatter plot (with common axes)

Bland-Altman plot

Encoding a third variable

Encoding a third variable

Example

  • The data used for these plots were collected, organized and distributed by the Tycho Project.
  • They include weekly reported counts data for seven diseases from 1928 to 2011, from all fifty states.

One state is easy

Paletts

  • Diverging colors are used to represent values that diverge from a center.
  • We put equal emphasis on both ends of the data range: higher than the center and lower than the center.
  • An example of when we would use a divergent pattern would be if we were to show height in standard deviations away from the average.
  • Here are some examples of divergent patterns:

Sequential Paletts

Divergent Paletts

library(RColorBrewer)
display.brewer.all(type="div")

Example

Alternative: eliminate one variable

Avoid pseudo three dimensional plots

(Source: Karl Broman)

Avoid pseudo three dimensional plots

Avoid gratuitousthree dimensional plots

  • Pseudo 3D is sometimes used completely gratuitously: plots are made to look 3D even when the 3rd dimension does not represent a quantity.
  • This only adds confusion and makes it harder to relay your message.

Avoid gratuitousthree dimensional plots

Avoid gratuitousthree dimensional plots

Avoid too any significant digits

  • By default, statistical software like R returns many significant digits.
  • The default behavior in R is to show 7 significant digits.
  • So many digits often adds no information and the visual clutter than can makes it hard for the consumer of your table to understand the message.
  • As an example here are the per 10,000 disease rates for California across the five decades

Avoid too any significant digits

state year Measles Pertussis Polio
California 1940 37.8826320 18.3397861 18.3397861
California 1950 13.9124205 4.7467350 4.7467350
California 1960 14.1386471 0.0000000 0.0000000
California 1970 0.9767889 0.0000000 0.0000000
California 1980 0.3743467 0.0515466 0.0515466

Avoid too any significant digits

  • We are reporting precision up to 0.00001 cases per 10,000, a very small value in the context the changes that are occurring across the dates.
  • In this case 2 significant figure is more than enough and makes the point that rates are decreasing clearly:

Avoid too any significant digits

state year Measles Pertussis Polio
California 1940 37.9 18.3 18.3
California 1950 13.9 4.7 4.7
California 1960 14.1 0.0 0.0
California 1970 1.0 0.0 0.0
California 1980 0.4 0.1 0.1

Compare vertically

  • Another principle, related to displaying tables, is to place values being compared on columns rather than rows.

Compare vertically

state year Measles Pertussis Polio
California 1940 37.9 18.3 18.3
California 1950 13.9 4.7 4.7
California 1960 14.1 0.0 0.0
California 1970 1.0 0.0 0.0
California 1980 0.4 0.1 0.1

Do not compare horizontally

state disease 1940 1950 1960 1970 1980
California Measles 37.9 13.9 14.1 1 0.4
California Pertussis 18.3 4.7 0.0 0 0.1
California Polio 18.3 4.7 0.0 0 0.1

Further reading:

  • ER Tufte (1983) The visual display of quantitative information. Graphics Press.
  • ER Tufte (1990) Envisioning information. Graphics Press.
  • ER Tufte (1997) Visual explanations. Graphics Press.
  • WS Cleveland (1993) Visualizing data. Hobart Press.
  • WS Cleveland (1994) The elements of graphing data. CRC Press.
  • A Gelman, C Pasarica, R Dodhia (2002) Let's practice what we preach: Turning tables into graphs. The American Statistician 56:121-130.
  • NB Robbins (2004) Creating more effective graphs. Wiley.
  • Nature Methods columns
  • A Cairo (2013) The Functional Art: An Introduction to Information Graphics and Visualization. New Riders
  • N Yau (2013) Data Points: Visualization That Means Something. Wiley